In [ ]:
import numpy as np

Array views and slicing

A NumPy array is an object of numpy.ndarray type:


In [ ]:
a = np.arange(3)
type(a)

All ndarrays have a .base attribute. If this attribute is not None, then the array is a view of some other object's memory, typically another ndarray. This is a very powerful tool, because allocating memory and copying memory contents are expensive operations, but updating metadata on how to interpret some already allocated memory is cheap!

The simplest way of creating an array's view is by slicing it:


In [ ]:
a = np.arange(3)
a.base is None

In [ ]:
a[:].base is None

Let's look more closely at what an array's metadata looks like. NumPy provides the np.info function, which can list for us some low level attributes of an array:


In [ ]:
np.info(a)

By the end of the workshop you will understand what most of these mean. But rather than listen through a lesson, you get to try and figure what they mean yourself. To help you with that, here's a function that prints the information from two arrays side by side:


In [ ]:
def info_for_two(one_array, another_array):
    """Prints side-by-side results of running np.info on its inputs."""
    def info_as_ordered_dict(array):
        """Converts return of np.infor into an ordered dict."""
        import collections
        import io
        buffer = io.StringIO()
        np.info(array, output=buffer)
        data = (
            item.split(':') for item in buffer.getvalue().strip().split('\n'))
        return collections.OrderedDict(
            ((key, value.strip()) for key, value in data))
    one_dict = info_as_ordered_dict(one_array)
    another_dict = info_as_ordered_dict(another_array)
    name_w = max(len(name) for name in one_dict.keys())
    one_w = max(len(name) for name in one_dict.values())
    another_w = max(len(name) for name in another_dict.values())
    output =  (
        f'{name:<{name_w}} : {one:>{one_w}} : {another:>{another_w}}'
        for name, one, another in zip(
            one_dict.keys(), one_dict.values(), another_dict.values()))
    print('\n'.join(output))

Exercise 1.

  1. Create a one dimensional NumPy array with a few items (consider using np.arange).
  2. Compare the printout of np.info on your array and on slices of it (use the [start:stop:step] indexing syntax, and make sure to try steps other than one).
  3. Do you see any patterns?

In [ ]:
# Your code goes here

Exercise 1 debrief

Every array has an underlying block of memory assigned to it. When we slice an array, rather than making a copy of it, NumPy makes a view, reusing the memory block, but interpreting it differently.

Lets take a look at what NumPy did for us in the above examples, and make sense of some of the changes to info.

  • shape: for a one dimensional array shape is a single item tuple, equal to the total number of items in the array. You can get the shape of an array as its .shape attribute.
  • strides: is also a single item tuple for one-dimensional arrays, its value being the number of bytes to skip in memory to get to the next item. And yes, strides can be negative. You can get this as the .strides attribute of any array.
  • data pointer: this is the address in memory of the first byte of the first item of the array. Note that this doesn't have to be the same as the first byte of the underlying memory block! You rarely need to know the exact address of the data pointer, but it's part of the string representation of the arrays .data attribute.
  • itemsize: this isn't properly an attribute of the array, but of it's data type. It is the number of bytes that an array item takes up in memory. You can get this value from an array as the .itemsize attribute of its .dtype attribute, i.e. array.dtype.itemsize.
  • type: this lets us know how each array item should be interpreted e.g. for calculations. We'll talk more about this later, but you can get an array's type object through its .dtype attribute.
  • contiguous: this is one of several boolean flags of an array. Its meaning is a little more specific, but for now lets say it tells us whether the array items use the memory block efficiently, without leaving unused spaces between items. It's value can be checked as the .contiguous attribute of the arrays .flags attribute

Exercise 2

Take a couple or minutes to familiarize yourself with the NumPy array's attributes discussed above:

  1. Create a small one dimensional array of your choosing.
  2. Look at its .shape, .strides, .dtype, .flags and .data attributes.
  3. For .dtype and .flags, store them into a separate variable, and use tab completion on those to explore their subattributes.

In [ ]:
# Your code goes here

A look at data types

Similarly to how we can change the shape, strides and data pointer of an array through slicing, we can change how it's items are interpreted by changing it's data type. This is done by calling the array's .view() method, and passing it the new data type.

But before we go there, lets look a little closer at dtypes. You are hopefully familiar with the basic NumPy numerical data types:

Type Family NumPy Defined Types Character Codes
boolean np.bool '?'
unsigned integers np.uint8 - np.uint64 'u1', 'u2', 'u4', 'u8'
signed integers np.int8 - np.int64 'i1', 'i2', 'i4', 'i8'
floating point np.float16 - np.float128 'f2', 'f4', 'f8', 'f16'
complex np.complex64, np.complex128 'c8', 'c16'

You can create a new data type by calling its constructor, np.dtype(), with either a NumPy defined type, or the character code.

Character codes can have '<' or '>' prepended, to indicate whether the type is little or big endian. If unspecified, native encoding is used, which for all practical purposes is going to be little endian.

Exercise 3

Let's play a little with dtype views:

  1. Create a simple array of a type you feel comfortable you understand, e.g. np.arange(4, dtype=np.uint16).
  2. Take a view of type np.uint8 of your array. This will give you the raw byte contents of your array. Is this what you were expecting?
  3. Take a few views of your array, with dtypes of larger itemsize, or changing the endianess of the data type. Try to predict what the output will be before running the examples.
  4. Take a look at the wikipedia page on single precision floating point numbers, more specifically its examples of encodings. Create arrays of four np.uint8 values which, when viewed as a np.float32 give the values 1, -2, and 1/3.

In [ ]:
# Your code goes here

The Constructor They Don't Want You To Know About.

You typically construct your NumPy arrays using one of the many factory fuctions provided, np.array() being the most popular. But it is also possible to call the np.ndarray object constructor directly. You will typically not want to do this, because there are probably simpler alternatives. But it is a great way of putting your understanding of views of arrays to the test!

You can check the full documentation, but the np.ndarray constructor takes the following arguments that we care about:

  • shape: the shape of the returned array,
  • dtype: the data type of the returned array,
  • buffer: an object to reuse the underlying memory from, e.g. an existing array or its .data attribute,
  • offset: by how many bytes to move the starting data pointer of the returned array relative to the passed buffer,
  • strides: the strides of the returned array.

Exercise 4

Write a function, using the np.ndarray constructor, that takes a one dimensional array and returns a reversed view of it.


In [ ]:
# Your code goes here

Reshaping Into Higher Dimensions

So far we have sticked to one dimensional arrays. Things get substantially more interesting when we move into higher dimensions.

One way of getting views with a different number of dimensions is by using the .reshape() method of NumPy arrays, or the equivalent np.reshape() function.

The first argument to any of the reshape functions is the new shape of the array. When providing it, keep in mind:

  • the total size of the array must stay unchanged, i.e. the product of the values of the new shape tuple must be equal to the product of the values of the old shape tuple.
  • by entering -1 for one of the new dimensions, you can have NumPy compute its value for you, but the other dimensions must be compatible with the calculated one being an integer.

.reshape() can also take an order= kwarg, which can be set to 'C' (as the programming language) or 'F' (for the Fortran programming language). This correspond to row and column major orders, respectively.

Exercise 5

Let's look at how multidimensional arrays are represented in NumPy with an exercise.

  1. Create a small linear array with a total length that is a multiple of two different small primes, e.g. 6 = 2 * 3.
  2. Reshape the array into a two dimensional one, starting with the default order='C'. Try both possible combinations of rows and columns, e.g. (2, 3) and (3, 2). Look at the resulting arrays, and compare their metadata. Do you understand what's going on?
  3. Try the same reshaping with order='F'. Can you see what the differences are?
  4. If you feel confident with these, give a higher dimensional array a try.

In [ ]:
# Your code goes here

Exercise 5 debrief

As the examples show, an n-dimensional array will have an n item tuple .shape and .strides. The number of dimensions can be directly queried from the .ndim attribute.

The shape tells us how large the array is along each dimension, the strides tell us how many bytes to skip in memory to get to the next item along each dimension.

When we reshape an array using C order, a.k.a. row major order, items along higher dimensions are closer in memory. When we use Fortran orser, a.k.a. column major order, it is items along smaller dimensions that are closer.

Reshaping with a purpose

One typical use of reshaping is to apply some aggregation function to equal subdivision of an array.

Say you have, e.g. a 12 item 1D array, and would like to compute the sum of every three items. This is how this is typically accomplished:


In [ ]:
a = np.arange(12, dtype=float)
a

In [ ]:
a.reshape(4, 3).sum(axis=-1)

You can apply fancier functions than .sum(), e.g. let's compute the variance of each group:


In [ ]:
a.reshape(4, 3).var(axis=-1)

Exercise 6

Your turn to do a fancier reshaping: we will compute the average of a 2D array over non-overlapping rectangular patches:

  1. Choose to small numbers m and n, e.g. 3 and 4.
  2. Create a 2D array, with number of rows a multiple of one of those numbers, and number of columns a multiple of the other, e.g. 15 x 24.
  3. Reshape and aggregate to create a 2D array holding the sums over non overlapping m x n tiles, e.g. a 5 x 6 array.
  4. Hint: .sum() can take a tuple of integers as axis=, so you can do the whole thing in a single reshape from 2D to 4D, then aggregate back to 2D. If tyou find this confusing, doing two aggregations will also work.

In [ ]:
# Your code goes here

Rearranging dimensions

Once we have a multidimensional array, rearranging the order of its dimensions is as simple as rearranging its .shape and .tuple attributes. You could do this with np.ndarray, but it would be a pain. NumPy has a bunch of functions for doing that, but they are all watered down versions of np.transpose, which takes a tuple with the desired permutation of the array dimensions.

Exercise 7

  1. Write a function roll_axis_to_end that takes an array and an axis, and makes that axis the last dimension of the array.
  2. For extra credit, rewrite your function using np.ndarray.

In [ ]:
# Your code goes here

Playing with strides

For the rest of the workshop we are going to dome some fancy tricks with strides, to create interesting views of an existing array.

Exercise 8

Create a function to extract the diagonal of a 2-D array, using the np.ndarray constructor.


In [ ]:
# Your code goes here

Exercise 9

  1. Something very interesting happens when we set a stride to zero. Give that idea some thought and then:
  2. Create two functions, stacked_column_vector and stacked_row_vector, that take a 1D array (the vector), and an integer n, and create a 2D view of the array that stack n copies of the vector, either as columns or rows of the view.
  3. Use this functions to create an outer_product function that takes two 1D vectors and computes their outer product.

In [ ]:
# Your code goes here

Exercise 10

In the last exercise we used zero strides to reuse an item more than once in the resulting view. Let's try to build on that idea:

  1. Write a function that takes a 1D array and a window integer value, and creates a 2D view of the array, each row a view through a sliding window of size window into the original array.
  2. Hint: There are len(array) - window + 1 such "views through a window".
  3. Another hint: Here's a small example expected run:

    >>> sliding_window(np.arange(4), 2) [[0, 1], [1, 2], [2, 3]]


In [ ]:
# Your code goes here

Parting pro tip

NumPy's worst kept secret is the existence of a mostly undocumented, mostly hidden, as_strided function, that makes creating views with funny strides much easier (and also much more dangerous!) than using np.ndarray. Here's the available documentation:


In [ ]:
from numpy.lib.stride_tricks import as_strided

np.info(as_strided)

Note that this function will not protect you, the way np.ndarray does, from accessing memory that is not indexed by the array the view is taken for. You may want to do that, but be wary of the world of segmentation faults you are getting yourself into!